Modeling Chinese Documents with Topical Word-Character Models

نویسندگان

  • Wei Hu
  • Nobuyuki Shimizu
  • Hiroshi Nakagawa
  • Huanye Sheng
چکیده

As Chinese text is written without word boundaries, effectively recognizing Chinese words is like recognizing collocations in English, substituting characters for words and words for collocations. However, existing topical models that involve collocations have a common limitation. Instead of directly assigning a topic to a collocation, they take the topic of a word within the collocation as the topic of the whole collocation. This is unsatisfactory for topical modeling of Chinese documents. Thus, we propose a topical word-character model (TWC), which allows two distinct types of topics: word topic and character topic. We evaluated TWC both qualitatively and quantitatively to show that it is a powerful and a promising topic model.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Modeling of Chinese Language Using Character-Word Relations

Topic models are hierarchical Bayesian models for language modeling and document analysis. It has been well-used and achieved a lot of success in modeling English documents. However, unlike English and the majority of alphabetic languages, the basic structural unit of Chinese language is character instead of word, and Chinese words are written without spaces between them. Most previous research...

متن کامل

Strategies of Processing Japanese Names and Character Variants in Traditional Chinese Text

This paper proposes an approach to identify word candidates that are not Traditional Chinese, including Japanese names (written in Japanese Kanji or Traditional Chinese characters) and word variants, when doing word segmentation on Traditional Chinese text. When handling personal names, a probability model concerning formats of names is introduced. We also propose a method to map Japanese Kanji...

متن کامل

Improving English and Chinese Ad-Hoc Retrieval: TIPSTER Text Phase 3 Final Report

We investigated both English and Chinese ad-hoc information retrieval (IR). Part of our objectives is to study the use of term, phrasal and topical concept level evidence, either individually or in combination, to improve retrieval accuracy. For short queries, we studied five term level techniques that together lead to improvements over standard ad-hoc 2-stage retrieval some 20% to 40% for TREC...

متن کامل

Okapi Chinese Text Retrieval Experiments at TREC-6

The focus of the Okapi TREC{6 Chinese experiments is on investigating the e ectiveness of di erent automatic indexing methods and phrase weighting for retrieval based on probabilistic models over Chinese text. We compare di erent probabilistic weighting methods based on a range of word and single character approaches. There are two indexing methods used in our experiments. One indexing method i...

متن کامل

Text classification in Asian languages without word segmentation

We present a simple approach for Asian language text classification without word segmentation, based on statistical -gram language modeling. In particular, we examine Chinese and Japanese text classification. With character -gram models, our approach avoids word segmentation. However, unlike traditional ad hoc -gram models, the statistical language modeling based approach has strong information...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008